Group Number - 8
Group Member - Soumi Giri, Samreen Pathan, Vanessa Murillo, Prachi Rathod

Introduction

Overview

Taxicabs are the only vehicles that have the right to pick up street-hailing and prearranged passengers anywhere in New York City. By law, there are 13,587 taxis in New York City and each taxi must have a medallion affixed to it. The fare estimation majorly depends on the duration of each trip along with other factors. Analyzing these factors would help in estimating the duration better, in turn, estimating the fare better.

Business Problem

We will predict the trip duration of a taxi trip in New York City based on traffic, weather and time based features.

Dataset Overview

The data is from public data recourse: Kaggle and additional attributes were added from Kaggle.

The data set structure demonstrates that it has 42 features and around 1.5mn records of taxi trips across New York for the year 2016.

The predictor variables are described below:

  • id - a unique identifier for each trip
  • vendor_id : A code indicating the provider associated with the trip record
  • pickup_datetime : Date and time when the meter was engaged
  • dropoff_datetime : Date and time when the meter was disengaged
  • passenger_count : The number of passengers in the vehicle (driver entered value)
  • pickup_longitude : The longitude where the meter was engaged
  • pickup_latitude : The latitude where the meter was engaged
  • dropoff_longitude : The longitude where the meter was disengaged
  • dropoff_latitude : The latitude where the meter was disengaged
  • store_and_fwd_flag: This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward
    N=not a store and forward trip
  • trip_duration : Duration of the trip in seconds
  • pickup_hour : The date and hour at which the meter was engaged
  • Temp. : The temperature(degree Celsius) at that particular hour in NYC
  • Windchill : An index value of the intersection between the actual temperature and wind speed
  • Heat.Index : A measure of intersection between relative humidity factored with the actual air temperature
  • Humidity : Measure of humidity
  • Pressure : Measure of pressure
  • Dew.Point : Measure of dew point
  • Visibility : Visibility index(directly proportional to visibility)
  • Wind.Dir : Direction of the wind
  • Wind.Speed : Speed of wind captured
  • Gust.Speed : Measure of sudden increase in wind speed above the average wind speed
  • Precip : Measure of precipitation
  • Events, conditions: Weather events or conditions observed in that timeframe
  • distance : Route distance in metres
  • duration : OSRM trip duration in seconds
  • motorway, trunk, primary, secondary, tertiary, unclassified, residential: The proportion spent on different kind of roads (% of total distance)
  • nTrafficSignals : The number of traffic signals.
  • nCrossing : The number of pedestrian crossing.
  • nStop : The number of stop signs.
  • nIntersection : The number of intersections (Intersection can be crossroad, but not a highway exit)
  • srcCounty : Pickup county
  • dstCounty : Dropoff county

Approach

  • We will be using the training dataset to train our model and test dataset to predict the trip distance

  • We will conduct data cleaning and perform an exploratory data analysis to generate some insights influence of the following factors on the trip distance:

  • Weather conditions

  • Traffic conditions

  • Trends over time/Seasonality

  • We will find plot the correlations and inspect the P values to get statistically significant features which will be used as the predictors

  • We plan to develop a Linear Regression in order to predict the trip duration

  • We will be using the RMSE and variance score as a metric to determine the performance of model in terms of prediction


Exploratory Data Analysis

Import Dataset and Library

For the New York Taxi Cab problem we are using the data provided in the Kaggle. Additional weather data has been included to account for the impact of weather conditions on trip duration. The weather data is NYC METAR data. A METAR weather report is predominantly used by pilots in fulfillment of a part of a pre-flight weather briefing, and by meteorologists, who use aggregated METAR information to assist in weather forecasting. Our data is aggregated information at hourly level for 2016.

To check the impact of various traffic related conditions we have additional imported OSRM data. OSRM stands for Open Source Routing Machine. It provides us with shortest paths in road networks and offers a multitude of functionality that is potentially useful when implementing maps/location based functionality. Our OSRM features had shortest distance and other road condition related features.

We. create our final data set with 42 features from the above 3 data sources. Our response variable is trip duration. For our analysis we have used the following libraries.

library(dplyr)
library(ggplot2)
library(Hmisc)
library(corrgram)
library(stats)
library(funModeling)
library(caret)
library(lubridate)

Once we have imported our necessary libraries we create our consolidated training data set.

### Importing our training data, weather data and OSRM data ###
train <- read.csv('train.csv')
weather <- read.csv('KNYC_Metars.csv')
osrm <- read.csv("train_augmented.csv")

### Creation of final dataset ###

##Adding an hourly column to the data
train$pickup_hour <- paste(substring(train$pickup_datetime,1,13), ":00:00", sep = "")

## Joining with weather data
final_weather <- left_join(train, weather, by = c("pickup_hour" = "Time") )

##Joining with OSRM data
final_data <- left_join(final_weather, osrm, by = "id")

## Creating a consolidated file for further analysis
write.csv(final_data, file = "Train_consolidated.csv")

Data Structure Analysis

### Importing final data set ##
train <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Train_consolidated.csv" , header = TRUE)

test <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/test_consolidate.csv" , header = TRUE)

head(train)
##   X        id vendor_id     pickup_datetime    dropoff_datetime passenger_count
## 1 1 id2875421         2 2016-03-14 17:24:55 2016-03-14 17:32:30               1
## 2 2 id2377394         1 2016-06-12 00:43:35 2016-06-12 00:54:38               1
## 3 3 id3858529         2 2016-01-19 11:35:24 2016-01-19 12:10:48               1
## 4 4 id3504673         2 2016-04-06 19:32:31 2016-04-06 19:39:40               1
## 5 5 id2181028         2 2016-03-26 13:30:55 2016-03-26 13:38:10               1
## 6 6 id0801584         2 2016-01-30 22:01:40 2016-01-30 22:09:03               6
##   pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## 1        -73.98215        40.76794         -73.96463         40.76560
## 2        -73.98042        40.73856         -73.99948         40.73115
## 3        -73.97903        40.76394         -74.00533         40.71009
## 4        -74.01004        40.71997         -74.01227         40.70672
## 5        -73.97305        40.79321         -73.97292         40.78252
## 6        -73.98286        40.74220         -73.99208         40.74918
##   store_and_fwd_flag trip_duration         pickup_hour Temp. Windchill
## 1                  N           455 2016-03-14 17:00:00   4.4      -0.5
## 2                  N           663 2016-06-12 00:00:00  28.9        NA
## 3                  N          2124 2016-01-19 11:00:00  -6.7     -14.3
## 4                  N           429 2016-04-06 19:00:00   7.2       3.3
## 5                  N           435 2016-03-26 13:00:00   9.4        NA
## 6                  N           443 2016-01-30 22:00:00   3.9       0.2
##   Heat.Index Humidity Pressure Dew.Point Visibility Wind.Dir Wind.Speed
## 1         NA     0.86   1017.5       2.2        8.0      ENE       27.8
## 2       29.9     0.53   1006.6      18.3       16.1     West        7.4
## 3         NA     0.46   1016.3     -16.7       16.1     West       24.1
## 4         NA     0.39   1019.1      -6.1       16.1    South       25.9
## 5         NA     0.46   1026.9      -1.7       16.1 Variable        9.3
## 6         NA     0.41   1015.8      -8.3       16.1       SW       16.7
##   Gust.Speed Precip Events Conditions distance duration motorway   trunk
## 1       57.4    0.3   None   Overcast   2009.1    160.9        0 0.00000
## 2        0.0    0.0   None    Unknown   2513.4    256.5        0 0.00000
## 3       46.3    0.0   None      Clear   9910.7    679.6        0 0.54282
## 4       35.2    0.0   None      Clear   1779.1    181.8        0 0.00000
## 5        0.0    0.0   None      Clear   1615.0    132.2        0 0.00000
## 6       29.6    0.0   None      Clear   1393.6    142.9        0 0.00000
##   primary secondary  tertiary unclassified residential nTrafficSignals
## 1       0  0.000000 1.0000000     0.000000  0.00000000              14
## 2       0  0.348518 0.1747760     0.000000  0.14390300              25
## 3       0  0.372717 0.0398055     0.000000  0.00686116              38
## 4       0  0.000000 0.4244520     0.000000  0.03974140              18
## 5       0  0.637338 0.3626630     0.000000  0.00000000              17
## 6       0  0.000000 0.2383930     0.537783  0.22382500              12
##   nCrossing nStop nIntersection srcCounty dstCounty bug
## 1         5     0             4         1         1   0
## 2        13     0             0         1         1   0
## 3        12     0             3         1         1   0
## 4         6     0             1         1         1   0
## 5         2     0             2         1         1   0
## 6         3     0             0         1         1   0
str(train)
## 'data.frame':    1458644 obs. of  42 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id                : chr  "id2875421" "id2377394" "id3858529" "id3504673" ...
##  $ vendor_id         : int  2 1 2 2 2 2 1 2 1 2 ...
##  $ pickup_datetime   : chr  "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
##  $ dropoff_datetime  : chr  "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
##  $ passenger_count   : int  1 1 1 1 1 6 4 1 1 1 ...
##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
##  $ trip_duration     : int  455 663 2124 429 435 443 341 1551 255 1225 ...
##  $ pickup_hour       : chr  "2016-03-14 17:00:00" "2016-06-12 00:00:00" "2016-01-19 11:00:00" "2016-04-06 19:00:00" ...
##  $ Temp.             : num  4.4 28.9 -6.7 7.2 9.4 3.9 18.9 13.9 25.6 23.9 ...
##  $ Windchill         : num  -0.5 NA -14.3 3.3 NA 0.2 NA NA NA NA ...
##  $ Heat.Index        : num  NA 29.9 NA NA NA NA NA NA NA NA ...
##  $ Humidity          : num  0.86 0.53 0.46 0.39 0.46 0.41 0.54 0.55 0.6 0.38 ...
##  $ Pressure          : num  1018 1007 1016 1019 1027 ...
##  $ Dew.Point         : num  2.2 18.3 -16.7 -6.1 -1.7 -8.3 9.4 5 17.2 8.9 ...
##  $ Visibility        : num  8 16.1 16.1 16.1 16.1 16.1 16.1 16.1 12.9 16.1 ...
##  $ Wind.Dir          : chr  "ENE" "West" "West" "South" ...
##  $ Wind.Speed        : num  27.8 7.4 24.1 25.9 9.3 16.7 7.4 0 9.3 7.4 ...
##  $ Gust.Speed        : num  57.4 0 46.3 35.2 0 29.6 0 0 0 0 ...
##  $ Precip            : num  0.3 0 0 0 0 0 0 0 0 0 ...
##  $ Events            : chr  "None" "None" "None" "None" ...
##  $ Conditions        : chr  "Overcast" "Unknown" "Clear" "Clear" ...
##  $ distance          : num  2009 2513 9911 1779 1615 ...
##  $ duration          : num  161 256 680 182 132 ...
##  $ motorway          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trunk             : num  0 0 0.543 0 0 ...
##  $ primary           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ secondary         : num  0 0.349 0.373 0 0.637 ...
##  $ tertiary          : num  1 0.1748 0.0398 0.4245 0.3627 ...
##  $ unclassified      : num  0 0 0 0 0 ...
##  $ residential       : num  0 0.1439 0.00686 0.03974 0 ...
##  $ nTrafficSignals   : int  14 25 38 18 17 12 19 33 9 58 ...
##  $ nCrossing         : int  5 13 12 6 2 3 0 18 2 11 ...
##  $ nStop             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nIntersection     : int  4 0 3 1 2 0 2 23 0 6 ...
##  $ srcCounty         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dstCounty         : int  1 1 1 1 1 1 1 4 1 1 ...
##  $ bug               : int  0 0 0 0 0 0 0 0 0 0 ...
summary(train)
##        X                id              vendor_id     pickup_datetime   
##  Min.   :      1   Length:1458644     Min.   :1.000   Length:1458644    
##  1st Qu.: 364662   Class :character   1st Qu.:1.000   Class :character  
##  Median : 729323   Mode  :character   Median :2.000   Mode  :character  
##  Mean   : 729323                      Mean   :1.535                     
##  3rd Qu.:1093983                      3rd Qu.:2.000                     
##  Max.   :1458644                      Max.   :2.000                     
##                                                                         
##  dropoff_datetime   passenger_count pickup_longitude  pickup_latitude
##  Length:1458644     Min.   :0.000   Min.   :-121.93   Min.   :34.36  
##  Class :character   1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74  
##  Mode  :character   Median :1.000   Median : -73.98   Median :40.75  
##                     Mean   :1.665   Mean   : -73.97   Mean   :40.75  
##                     3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77  
##                     Max.   :9.000   Max.   : -61.34   Max.   :51.88  
##                                                                      
##  dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration    
##  Min.   :-121.93   Min.   :32.18    Length:1458644     Min.   :      1  
##  1st Qu.: -73.99   1st Qu.:40.74    Class :character   1st Qu.:    397  
##  Median : -73.98   Median :40.75    Mode  :character   Median :    662  
##  Mean   : -73.97   Mean   :40.75                       Mean   :    959  
##  3rd Qu.: -73.96   3rd Qu.:40.77                       3rd Qu.:   1075  
##  Max.   : -61.34   Max.   :43.92                       Max.   :3526282  
##                                                                         
##  pickup_hour            Temp.          Windchill        Heat.Index     
##  Length:1458644     Min.   :-18.30   Min.   :-28.4    Min.   :26.6     
##  Class :character   1st Qu.:  3.90   1st Qu.: -4.5    1st Qu.:27.4     
##  Mode  :character   Median : 10.60   Median : -0.5    Median :27.8     
##                     Mean   : 11.06   Mean   : -1.5    Mean   :28.4     
##                     3rd Qu.: 18.00   3rd Qu.:  2.6    3rd Qu.:29.1     
##                     Max.   : 32.20   Max.   :  7.1    Max.   :33.6     
##                     NA's   :12032    NA's   :977900   NA's   :1412890  
##     Humidity        Pressure        Dew.Point         Visibility   
##  Min.   :0.090   Min.   : 989.5   Min.   :-28.300   Min.   : 0.40  
##  1st Qu.:0.380   1st Qu.:1010.2   1st Qu.: -6.100   1st Qu.:14.50  
##  Median :0.490   Median :1015.9   Median :  1.100   Median :16.10  
##  Mean   :0.521   Mean   :1015.6   Mean   :  0.644   Mean   :14.69  
##  3rd Qu.:0.650   3rd Qu.:1020.9   3rd Qu.:  8.300   3rd Qu.:16.10  
##  Max.   :1.000   Max.   :1036.8   Max.   : 20.600   Max.   :16.10  
##  NA's   :12032   NA's   :56031    NA's   :12032     NA's   :56597  
##    Wind.Dir           Wind.Speed        Gust.Speed        Precip      
##  Length:1458644     Min.   :  0.000   Min.   : 0.00   Min.   : 0.000  
##  Class :character   1st Qu.:  5.600   1st Qu.: 0.00   1st Qu.: 0.000  
##  Mode  :character   Median :  9.300   Median : 0.00   Median : 0.000  
##                     Mean   :  8.962   Mean   : 8.01   Mean   : 0.068  
##                     3rd Qu.: 13.000   3rd Qu.: 0.00   3rd Qu.: 0.000  
##                     Max.   :137.000   Max.   :72.20   Max.   :11.900  
##                     NA's   :12032     NA's   :12032   NA's   :12032   
##     Events           Conditions           distance          duration      
##  Length:1458644     Length:1458644     Min.   :      0   Min.   :    0.0  
##  Class :character   Class :character   1st Qu.:   1654   1st Qu.:  155.4  
##  Mode  :character   Mode  :character   Median :   2730   Median :  249.2  
##                                        Mean   :   4569   Mean   :  343.5  
##                                        3rd Qu.:   5049   3rd Qu.:  424.0  
##                                        Max.   :1038310   Max.   :46645.3  
##                                        NA's   :1         NA's   :1        
##     motorway             trunk          primary          secondary     
##  Min.   :0.0000000   Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000000   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000000   Median :0.000   Median :0.00000   Median :0.1314  
##  Mean   :0.0004421   Mean   :0.126   Mean   :0.03019   Mean   :0.2665  
##  3rd Qu.:0.0000000   3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.4580  
##  Max.   :1.0000000   Max.   :1.000   Max.   :1.00000   Max.   :1.0000  
##  NA's   :1           NA's   :1       NA's   :1         NA's   :1       
##     tertiary        unclassified      residential      nTrafficSignals
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :  0.0  
##  1st Qu.:0.08329   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 14.0  
##  Median :0.34729   Median :0.00000   Median :0.00000   Median : 21.0  
##  Mean   :0.41943   Mean   :0.01338   Mean   :0.04399   Mean   : 24.5  
##  3rd Qu.:0.75704   3rd Qu.:0.00000   3rd Qu.:0.02493   3rd Qu.: 32.0  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :112.0  
##  NA's   :1         NA's   :1         NA's   :1         NA's   :1      
##    nCrossing          nStop         nIntersection      srcCounty    
##  Min.   : 0.000   Min.   : 0.0000   Min.   :  0.00   Min.   :1.000  
##  1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.:  0.00   1st Qu.:1.000  
##  Median : 6.000   Median : 0.0000   Median :  1.00   Median :1.000  
##  Mean   : 8.812   Mean   : 0.1153   Mean   :  2.52   Mean   :1.214  
##  3rd Qu.:13.000   3rd Qu.: 0.0000   3rd Qu.:  3.00   3rd Qu.:1.000  
##  Max.   :91.000   Max.   :10.0000   Max.   :132.00   Max.   :5.000  
##  NA's   :1        NA's   :1         NA's   :1        NA's   :1228   
##    dstCounty          bug          
##  Min.   :1.000   Min.   :0.000000  
##  1st Qu.:1.000   1st Qu.:0.000000  
##  Median :1.000   Median :0.000000  
##  Mean   :1.271   Mean   :0.003143  
##  3rd Qu.:1.000   3rd Qu.:0.000000  
##  Max.   :5.000   Max.   :3.000000  
##  NA's   :6339

We now check the distribution of our response variable trip duration.

## Distribution of response variable
attach(train)
hist(trip_duration)

The response variable trip duration is given in seconds in the data and we observe there are few outlier values in the data from histogram. We also observe that trip duration is highly skewed to the left.

Now we compare the structure of our test data with our train data to understand is the train data distribution can be compared with the test for accurate prediction of trip duration

train1<- train[, !colnames(train) %in% c("bug")]
test1<- test[, !colnames(test) %in% c("Day")]

train1$set <- "train"
test1$set  <- "test"
train_col <- colnames(train1)

test1$trip_duration  <- NA
test1$dropoff_datetime <- NA

test2 <- test1[,train_col]
full <- rbind(train1, test1)
obs_day <- full %>% 
  mutate(
    pickup_date  = ymd_hms(pickup_datetime),
    pickup_date  = date(pickup_datetime)
  ) %>%
  group_by(set, pickup_date) %>%
  summarise(
    obs = n()
  )

ggplot(obs_day, aes(x=pickup_date, y=obs, group=set, colour=set)) +
  geom_line() +
  facet_grid(set~.) +
  theme_minimal()

As seen from the plot above, the distribution of train and test data is very similar. For this reason, we have identified the outliers but did not treat them as the treatment would have impacted our model’s prediction on the test data.

Missing Value treatment

From our above summary we observe that there are few NULL values present in our weather variables - Temp., Humidity, Pressure, Dew.Point, Visibility, Wind.Speed, Gust.Speed, Precip, Events, Condition. For the missing values in the continuous variables in the weather data we replace the values with mean of the month and for the categorical variables we replace the missing values with mode.

library(dplyr)

weather <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/KNYC_Metars.csv")

## Creating a function in R calculate mode for categorical variables ##
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

## Calculating monthly average for weather variables ##
final_weather <- weather %>% mutate(Date_avg = substring(Time,1,10)) %>% 
                  group_by(Date_avg) %>% 
                  summarise(Temp_mean = mean(Temp.), Humidity_mean = mean(Humidity), Press_mean = mean(Pressure),
                            Dew_mean = mean(Dew.Point), Visibility_mean = mean(Visibility), Winddir_mode =   getmode(Wind.Dir),
                            Windspeed_mean = mean(Wind.Speed), Gustsp_mean = mean(Gust.Speed), Precip_mean = mean(Precip),
                            Events_mode = getmode(Events), Conditions_mode = getmode(Conditions)) %>% 
                  select(Date_avg, Temp_mean, Humidity_mean, Press_mean, Dew_mean, Visibility_mean, Winddir_mode, 
                         Windspeed_mean, Gustsp_mean, Precip_mean, Events_mode, Conditions_mode)

## Extracting month date in our train data ##
train$Day <- substring(train$pickup_datetime, 1, 10)

#Replacing all NULL weather values in train data with mean and mode values

train_colnames <- colnames(train)

train_natreat <- left_join(train, final_weather, by = c("Day" = "Date_avg"))

train_natreat$Temp.[is.na(train_natreat$Temp.)] <- train_natreat$Temp_mean[is.na(train_natreat$Temp.)]
train_natreat$Humidity[is.na(train_natreat$Humidity)] <- train_natreat$Humidity_mean[is.na(train_natreat$Humidity)]
train_natreat$Pressure[is.na(train_natreat$Pressure)] <- train_natreat$Press_mean[is.na(train_natreat$Pressure)]
train_natreat$Dew.Point[is.na(train_natreat$Dew.Point)] <- train_natreat$Dew_mean[is.na(train_natreat$Dew.Point)]
train_natreat$Visibility[is.na(train_natreat$Visibility)] <- train_natreat$Visibility_mean[is.na(train_natreat$Visibility)]
train_natreat$Wind.Dir[is.na(train_natreat$Wind.Dir)] <- train_natreat$Winddir_mode[is.na(train_natreat$Wind.Dir)]
train_natreat$Wind.Speed[is.na(train_natreat$Wind.Speed)] <- train_natreat$Windspeed_mean[is.na(train_natreat$Wind.Speed)]
train_natreat$Gust.Speed[is.na(train_natreat$Gust.Speed)] <- train_natreat$Gustsp_mean[is.na(train_natreat$Gust.Speed)]
train_natreat$Precip[is.na(train_natreat$Precip)] <- train_natreat$Precip_mean[is.na(train_natreat$Precip)]
train_natreat$Events[is.na(train_natreat$Events)] <- train_natreat$Events_mode[is.na(train_natreat$Events)]
train_natreat$Conditions[is.na(train_natreat$Conditions)] <- train_natreat$Conditions_mode[is.na(train_natreat$Conditions)]


train <- train_natreat %>% select(all_of(train_colnames))

remove(train_natreat)

Finally we also have 1 NULL value in our distance variable that can be observed from the summary statistics provided above. We remove this observation from our final data.

## Removing the row with distance as NULL ##
train <- train %>% filter(is.na(distance) == FALSE)

Duplicate record checks

In our data each trip is identified by a unique ID column. We check if our id has any duplicates in them.

## Duplicate check ##
length(unique(train$id))
## [1] 1458643

From the above check we can conclude our data doesn’t have any duplicate values.

Outlier Study

we now check our trip duration variable. As our trip duration variable is skewed we take log transformation.

hist(log(train$trip_duration),
     main = "Distribution of log transformation of trip duration")

We check our distance variable for outliers.

## Checking distribution of distance ##
quantile(train$distance, c(.10, .20, .30, .40, .50, .60, .70, .80, .90, 1), na.rm = TRUE)
##        10%        20%        30%        40%        50%        60%        70% 
##    1083.20    1472.24    1840.30    2246.70    2729.70    3361.40    4315.70 
##        80%        90%       100% 
##    6108.40   10451.10 1038310.00

From this we can identify there are outliers at greater than 99 percentile. We don’t remove these outliers as our test data also has these values, if we remove these from our training data the model will not be able to predict.

Variable correlation

To understand the impact of our variables on our response variable trip duration view the correlation matrix for the variable. We have divided our predictor variables in 3 categories:

  1. Weather information
  2. Traffic related information
  3. Other factors impacting trip duration

Weather variables

We have 13 weather related features in our data. Among our 13 features we have 2 categorical features and 11 continuous features. We check the correlation with our 11 continuous features. Among the 11 features we have removed Windchill and Heat Index as these columns have high numbers of NULL which we are not able to impute

#Weather
library(corrgram)

weather_cont %>% corrgram()

Traffic Variables

Evaluating the correlation between traffic variables.

#Traffic data

train <- train %>% mutate(trip_duration_minutes = trip_duration / 60)
traffic_data <- train[, c("nTrafficSignals", "nCrossing", "nStop", "nIntersection", "trip_duration_minutes")]
# checking for any missing values in the above variables of the data
sapply(traffic_data, function(x) sum(is.na(x)))
#replacing NA values with zeros
traffic_data[is.na(traffic_data)] <- 0
#checking the NA values in the data frame again
sapply(traffic_data, function(x) sum(is.na(x)))


#calculating the correlation graph
corrgram(traffic_data)

From the above graph we can observe that there has is a high positive correlation between No. of traffic signals and No. of Pedestrians crossings. There appears to be a slight strong correlation between No. of Traffic signals and No. of Intersections as well. We can observe that there is a weak correlation between rest of the variables.

Road Conditions

We are now going to observe the correlation between variables so as to understand if any of these variables should be excluded from our model.

corrgram::corrgram(clean_road_data)

As illustrated above, correlation is fairly weak across the board. The most salient being a coefficient of -0.54 between secondary and tertiary roads. In practical terms this makes sense as trip routes taken will likely either be by using tertiary or secondary roads. From this analyses we can conclude all road-type variables should be considered for modeling.

Consolidated correlation analysis

We studies the correlation between all our predictor variables in order to eliminate any highly correlated variables. The variables with correlation value > 0.7 are eliminated.

library(Hmisc)
Cor_cont <- train %>% select(passenger_count, Temp.,Windchill,Heat.Index,Humidity,Pressure,Dew.Point,Visibility,Wind.Speed,Gust.Speed,Precip,distance,nTrafficSignals,nCrossing,nStop,nIntersection)
Cor_all <- rcorr(as.matrix(Cor_cont))
## Warning in sqrt(npair - 2): NaNs produced
Cor_all
##                 passenger_count Temp. Windchill Heat.Index Humidity Pressure
## passenger_count            1.00  0.00      0.00       0.01     0.00     0.00
## Temp.                      0.00  1.00      0.98       0.91     0.00    -0.22
## Windchill                  0.00  0.98      1.00         NA     0.24    -0.20
## Heat.Index                 0.01  0.91        NA       1.00     0.13     0.04
## Humidity                   0.00  0.00      0.24       0.13     1.00    -0.34
## Pressure                   0.00 -0.22     -0.20       0.04    -0.34     1.00
## Dew.Point                  0.00  0.84      0.74       0.59     0.53    -0.37
## Visibility                 0.00 -0.09     -0.06      -0.35    -0.61     0.16
## Wind.Speed                 0.00 -0.27     -0.29       0.03    -0.03    -0.10
## Gust.Speed                 0.00 -0.28     -0.33       0.14    -0.08    -0.17
## Precip                     0.00 -0.02      0.05        NaN     0.28    -0.11
## distance                   0.01  0.01      0.01       0.00     0.01     0.00
## nTrafficSignals            0.01  0.01      0.02       0.00     0.00     0.00
## nCrossing                  0.01  0.00      0.01       0.01     0.00     0.00
## nStop                      0.00  0.00      0.00       0.01     0.01     0.00
## nIntersection              0.00  0.00      0.01       0.00     0.01     0.00
##                 Dew.Point Visibility Wind.Speed Gust.Speed Precip distance
## passenger_count      0.00       0.00       0.00       0.00   0.00     0.01
## Temp.                0.84      -0.09      -0.27      -0.28  -0.02     0.01
## Windchill            0.74      -0.06      -0.29      -0.33   0.05     0.01
## Heat.Index           0.59      -0.35       0.03       0.14    NaN     0.00
## Humidity             0.53      -0.61      -0.03      -0.08   0.28     0.01
## Pressure            -0.37       0.16      -0.10      -0.17  -0.11     0.00
## Dew.Point            1.00      -0.36      -0.25      -0.29   0.11     0.02
## Visibility          -0.36       1.00      -0.04       0.02  -0.43     0.01
## Wind.Speed          -0.25      -0.04       1.00       0.60   0.07    -0.01
## Gust.Speed          -0.29       0.02       0.60       1.00   0.05    -0.01
## Precip               0.11      -0.43       0.07       0.05   1.00     0.00
## distance             0.02       0.01      -0.01      -0.01   0.00     1.00
## nTrafficSignals      0.01       0.02      -0.01      -0.01  -0.01     0.23
## nCrossing            0.01       0.01      -0.01      -0.01   0.00     0.10
## nStop                0.00       0.00       0.00       0.00   0.00     0.07
## nIntersection        0.01       0.01      -0.01      -0.01   0.00     0.44
##                 nTrafficSignals nCrossing nStop nIntersection
## passenger_count            0.01      0.01  0.00          0.00
## Temp.                      0.01      0.00  0.00          0.00
## Windchill                  0.02      0.01  0.00          0.01
## Heat.Index                 0.00      0.01  0.01          0.00
## Humidity                   0.00      0.00  0.01          0.01
## Pressure                   0.00      0.00  0.00          0.00
## Dew.Point                  0.01      0.01  0.00          0.01
## Visibility                 0.02      0.01  0.00          0.01
## Wind.Speed                -0.01     -0.01  0.00         -0.01
## Gust.Speed                -0.01     -0.01  0.00         -0.01
## Precip                    -0.01      0.00  0.00          0.00
## distance                   0.23      0.10  0.07          0.44
## nTrafficSignals            1.00      0.56  0.08          0.16
## nCrossing                  0.56      1.00  0.14          0.03
## nStop                      0.08      0.14  1.00          0.03
## nIntersection              0.16      0.03  0.03          1.00
## 
## n
##                 passenger_count   Temp. Windchill Heat.Index Humidity Pressure
## passenger_count         1458643 1458643    480744      45754  1458643  1410341
## Temp.                   1458643 1458643    480744      45754  1458643  1410341
## Windchill                480744  480744    480744          0   480744   480744
## Heat.Index                45754   45754         0      45754    45754    45754
## Humidity                1458643 1458643    480744      45754  1458643  1410341
## Pressure                1410341 1410341    480744      45754  1410341  1410341
## Dew.Point               1458643 1458643    480744      45754  1458643  1410341
## Visibility              1409775 1409775    480744      45754  1409775  1409775
## Wind.Speed              1458643 1458643    480744      45754  1458643  1410341
## Gust.Speed              1458643 1458643    480744      45754  1458643  1410341
## Precip                  1458643 1458643    480744      45754  1458643  1410341
## distance                1458643 1458643    480744      45754  1458643  1410341
## nTrafficSignals         1458643 1458643    480744      45754  1458643  1410341
## nCrossing               1458643 1458643    480744      45754  1458643  1410341
## nStop                   1458643 1458643    480744      45754  1458643  1410341
## nIntersection           1458643 1458643    480744      45754  1458643  1410341
##                 Dew.Point Visibility Wind.Speed Gust.Speed  Precip distance
## passenger_count   1458643    1409775    1458643    1458643 1458643  1458643
## Temp.             1458643    1409775    1458643    1458643 1458643  1458643
## Windchill          480744     480744     480744     480744  480744   480744
## Heat.Index          45754      45754      45754      45754   45754    45754
## Humidity          1458643    1409775    1458643    1458643 1458643  1458643
## Pressure          1410341    1409775    1410341    1410341 1410341  1410341
## Dew.Point         1458643    1409775    1458643    1458643 1458643  1458643
## Visibility        1409775    1409775    1409775    1409775 1409775  1409775
## Wind.Speed        1458643    1409775    1458643    1458643 1458643  1458643
## Gust.Speed        1458643    1409775    1458643    1458643 1458643  1458643
## Precip            1458643    1409775    1458643    1458643 1458643  1458643
## distance          1458643    1409775    1458643    1458643 1458643  1458643
## nTrafficSignals   1458643    1409775    1458643    1458643 1458643  1458643
## nCrossing         1458643    1409775    1458643    1458643 1458643  1458643
## nStop             1458643    1409775    1458643    1458643 1458643  1458643
## nIntersection     1458643    1409775    1458643    1458643 1458643  1458643
##                 nTrafficSignals nCrossing   nStop nIntersection
## passenger_count         1458643   1458643 1458643       1458643
## Temp.                   1458643   1458643 1458643       1458643
## Windchill                480744    480744  480744        480744
## Heat.Index                45754     45754   45754         45754
## Humidity                1458643   1458643 1458643       1458643
## Pressure                1410341   1410341 1410341       1410341
## Dew.Point               1458643   1458643 1458643       1458643
## Visibility              1409775   1409775 1409775       1409775
## Wind.Speed              1458643   1458643 1458643       1458643
## Gust.Speed              1458643   1458643 1458643       1458643
## Precip                  1458643   1458643 1458643       1458643
## distance                1458643   1458643 1458643       1458643
## nTrafficSignals         1458643   1458643 1458643       1458643
## nCrossing               1458643   1458643 1458643       1458643
## nStop                   1458643   1458643 1458643       1458643
## nIntersection           1458643   1458643 1458643       1458643
## 
## P
##                 passenger_count Temp.  Windchill Heat.Index Humidity Pressure
## passenger_count                 0.0876 0.0319    0.0016     0.0000   0.0000  
## Temp.           0.0876                 0.0000    0.0000     0.0000   0.0000  
## Windchill       0.0319          0.0000                      0.0000   0.0000  
## Heat.Index      0.0016          0.0000                      0.0000   0.0000  
## Humidity        0.0000          0.0000 0.0000    0.0000              0.0000  
## Pressure        0.0000          0.0000 0.0000    0.0000     0.0000           
## Dew.Point       0.1805          0.0000 0.0000    0.0000     0.0000   0.0000  
## Visibility      0.0000          0.0000 0.0000    0.0000     0.0000   0.0000  
## Wind.Speed      0.2683          0.0000 0.0000    0.0000     0.0000   0.0000  
## Gust.Speed      0.0544          0.0000 0.0000    0.0000     0.0000   0.0000  
## Precip          0.6320          0.0000 0.0000               0.0000   0.0000  
## distance        0.0000          0.0000 0.0000    0.8567     0.0000   0.0463  
## nTrafficSignals 0.0000          0.0000 0.0000    0.6446     0.0000   0.1450  
## nCrossing       0.0000          0.0000 0.0000    0.0051     0.5581   0.0199  
## nStop           0.0068          0.2629 0.6635    0.2144     0.0000   0.0139  
## nIntersection   0.0030          0.1234 0.0000    0.4869     0.0000   0.0001  
##                 Dew.Point Visibility Wind.Speed Gust.Speed Precip distance
## passenger_count 0.1805    0.0000     0.2683     0.0544     0.6320 0.0000  
## Temp.           0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
## Windchill       0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
## Heat.Index      0.0000    0.0000     0.0000     0.0000            0.8567  
## Humidity        0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
## Pressure        0.0000    0.0000     0.0000     0.0000     0.0000 0.0463  
## Dew.Point                 0.0000     0.0000     0.0000     0.0000 0.0000  
## Visibility      0.0000               0.0000     0.0000     0.0000 0.0000  
## Wind.Speed      0.0000    0.0000                0.0000     0.0000 0.0000  
## Gust.Speed      0.0000    0.0000     0.0000                0.0000 0.0000  
## Precip          0.0000    0.0000     0.0000     0.0000            0.0000  
## distance        0.0000    0.0000     0.0000     0.0000     0.0000         
## nTrafficSignals 0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
## nCrossing       0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
## nStop           0.0000    0.0394     0.0028     0.0000     0.6320 0.0000  
## nIntersection   0.0000    0.0000     0.0000     0.0000     0.0000 0.0000  
##                 nTrafficSignals nCrossing nStop  nIntersection
## passenger_count 0.0000          0.0000    0.0068 0.0030       
## Temp.           0.0000          0.0000    0.2629 0.1234       
## Windchill       0.0000          0.0000    0.6635 0.0000       
## Heat.Index      0.6446          0.0051    0.2144 0.4869       
## Humidity        0.0000          0.5581    0.0000 0.0000       
## Pressure        0.1450          0.0199    0.0139 0.0001       
## Dew.Point       0.0000          0.0000    0.0000 0.0000       
## Visibility      0.0000          0.0000    0.0394 0.0000       
## Wind.Speed      0.0000          0.0000    0.0028 0.0000       
## Gust.Speed      0.0000          0.0000    0.0000 0.0000       
## Precip          0.0000          0.0000    0.6320 0.0000       
## distance        0.0000          0.0000    0.0000 0.0000       
## nTrafficSignals                 0.0000    0.0000 0.0000       
## nCrossing       0.0000                    0.0000 0.0000       
## nStop           0.0000          0.0000           0.0000       
## nIntersection   0.0000          0.0000    0.0000

Based on the correlation coefficients obtained above, we have removed the following features from further analysis as they were highly correlated (coefficient >0.7): - Windchill(0.97 with Temp) - Heat Index (0.90 with Temp) - Dew Point(0.83 with Temp)


Feature Selection And Engineering

Based on our above correlation analysis we select our final features for trip duration prediction. Before we are able to feed the selected features in our model we will perform few pre processing to encode our categorical variables. We use One hot encoding to convert our categorical variable for our model. In One hot encoding each category in a variable is transformed into a column with binary value of 0 and 1.

We perform all these pre-processing steps on both our train and test data set to prepare for model input and prediction.

## Final which we require for our model to train##
train_model <- train %>% mutate(pickup_hour = format(as.POSIXct(train$pickup_datetime,format="%Y-%m-%d %H:%M:%S"),"%H"),
                                pickup_date = as.POSIXct(substring(train$pickup_datetime,1,10),format="%Y-%m-%d"),
                                pickup_month = month(as.POSIXct(substring(train$pickup_datetime,1,10),format="%Y-%m-%d")),
                                pickup_dayofweek = weekdays(as.POSIXct(train$pickup_datetime,format="%Y-%m-%d %H:%M:%S")),
                                pickup_Day = format(train$pickup_date, format = "%d"),
                                pickup_latitude3 = round(train$pickup_latitude, digits = 3),
                                pickup_longitude3 = round(train$pickup_longitude, digits = 3),
                                dropoff_latitude3 = round(train$dropoff_latitude, digits = 3),
                                dropoff_longitude3 = round(train$dropoff_longitude, digits = 3)
)

## Adding fields which we require for our model to test
test_model <- test %>% mutate(pickup_hour = format(as.POSIXct(test$pickup_datetime,format="%Y-%m-%d %H:%M:%S"),"%H"),
                              pickup_date = as.POSIXct(substring(test$pickup_datetime,1,10),format="%Y-%m-%d"),
                              pickup_month = month(as.POSIXct(substring(test$pickup_datetime,1,10),format="%Y-%m-%d")),
                              pickup_dayofweek = weekdays(as.POSIXct(test$pickup_datetime,format="%Y-%m-%d %H:%M:%S")),
                              pickup_Day = format(test$pickup_date, format = "%d"),
                              pickup_latitude3 = round(test$pickup_latitude, digits = 3),
                              pickup_longitude3 = round(test$pickup_longitude, digits = 3),
                              dropoff_latitude3 = round(test$dropoff_latitude, digits = 3),
                              dropoff_longitude3 = round(test$dropoff_longitude, digits = 3)
)

write.csv(train_model, file = "train_final_model.csv")
write.csv(test_model, file = "test_final_model.csv")

Now we have added our additional features to the data we can move to encoding the categorical variables and creating the final data for model.

## Adding fields which we require for our model to train##
train_model <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/train_final_model.csv")

#Changing to factors for encoding
train_model$vendor_id <- factor(train_model$vendor_id, exclude = NULL)
train_model$store_and_fwd_flag <- factor(train_model$store_and_fwd_flag, exclude = NULL)
train_model$Conditions <- factor(train_model$Conditions, exclude = NULL)
train_model$pickup_hour <- factor(train_model$pickup_hour, exclude = NULL)
train_model$pickup_dayofweek <- factor(train_model$pickup_dayofweek, exclude = NULL)
train_model$pickup_month <- factor(train_model$pickup_month, exclude = NULL)
train_model$pickup_Day <- factor(train_model$pickup_Day, exclude = NULL)
train_model$dstCounty <- factor(train_model$dstCounty, exclude = NULL)
train_model$srcCounty <- factor(train_model$dstCounty, exclude = NULL)

## Adding fields which we require for our model to test
test_model <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/test_final_model.csv")

#Changing to factor for encoding
test_model$vendor_id <- factor(test_model$vendor_id, exclude = NULL)
test_model$store_and_fwd_flag <- factor(test_model$store_and_fwd_flag, exclude = NULL)
test_model$Conditions <- factor(test_model$Conditions, exclude = NULL)
test_model$pickup_hour <- factor(test_model$pickup_hour, exclude = NULL)
test_model$pickup_dayofweek <- factor(test_model$pickup_dayofweek, exclude = NULL)
test_model$pickup_month <- factor(test_model$pickup_month, exclude = NULL)
test_model$pickup_Day <- factor(test_model$pickup_Day, exclude = NULL)
test_model$dstCounty <- factor(test_model$dstCounty, exclude = NULL)
test_model$srcCounty <- factor(test_model$dstCounty, exclude = NULL)


#### Encoding categorical variables in train
en_train <- train_model

## Changing NA values to factor for encoding
en_train$dstCounty <- addNA(en_train$dstCounty)
en_train$srcCounty <- addNA(en_train$srcCounty)

## One hot encoding

dv_train <- caret::dummyVars(" ~ vendor_id + store_and_fwd_flag + Conditions + pickup_hour + pickup_dayofweek + pickup_month + pickup_Day + dstCounty + srcCounty", 
                             data = en_train) 

en_train <- data.frame(predict(dv_train, newdata = train_model))

## Dropping NA values for pickup_hour, pickup_dayofweek and pickup_Day

en_train <- en_train[,!colnames(en_train) %in% c("pickup_hour.NA", "pickup_dayofweek.NA", "pickup_Day.NA")]

#### Creating final dataset for the model training

## Response variable
Y <- train_model$log_trip_duration

##Predictor variables
cont_cols <- c("passenger_count", "Temp.", "Humidity", "Wind.Speed", "Gust.Speed", "Precip", "distance",
               "motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "residential", "nTrafficSignals","nCrossing", "nStop", "nIntersection")
X_cont <-  train_model[,cont_cols]
x_dis <- en_train

X <- cbind(X_cont,x_dis)

train_final <- cbind(Y, X)

## Encoding categorical variables in test

en_test <- test_model

# Changing NA values to factor levels
en_test$dstCounty <- addNA(en_test$dstCounty)
en_test$srcCounty <- addNA(en_test$srcCounty)

#One hot encoding
dv_test <- caret::dummyVars(" ~ vendor_id + store_and_fwd_flag + Conditions + pickup_hour + pickup_dayofweek + pickup_month + pickup_Day + dstCounty + srcCounty", 
                             data = en_test) 
en_test <- data.frame(predict(dv_test, newdata = test_model))

## Creating final dataset for prediction

cont_cols_pred <- c("passenger_count", "Temp.", "Humidity", "Wind.Speed", "Gust.Speed", "Precip", "distance",
               "motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "residential", "nTrafficSignals",
               "nCrossing", "nStop", "nIntersection")
X_cont_pred <-  test_model[,cont_cols]
x_dis_pred <- en_test

X_pred <- cbind(X_cont_pred,x_dis_pred)

After our feature engineering steps we now have 118 feature for our model. Before inputing our train data set to our model we split our data in 70:30 ratio to assess our model performance.

### Split train to 70/30 to assess model performance

set.seed(123)
train_split <- sample(nrow(train_final), 0.70*nrow(train_final))

model_train <- train_final[train_split,]
model_test <- train_final[-train_split,]

Model Creation

To predict the taxi trip duration in our test data set we have selected linear regression algorithm. Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

In our case our input variable x is our 118 features that we created in the above steps and our output y variable is log transformation of trip duration. Our model is given by.

lm_model1 <- lm(Y ~ ., data = model_train)

Test Data Validation

Now to test the performance of our model we check our out of sample performance of 30% data split we created previously from our training dataset.

OOS_lm_model1 <- predict(lm_model1, model_test)
## Warning in predict.lm(lm_model1, model_test): prediction from a rank-deficient
## fit may be misleading
### Defining our model assessment function
calc_performance <- function(actual, pred) {
  
  rmse <- sqrt(mean((actual - pred)**2))
  mae <- mean(abs(actual - pred))
  mape <- mean(abs((actual-pred)/actual))
  
  retvals <- list(rmse = rmse, mae = mae, mape = mape)
  return(retvals)
  
}


### Our model performance metrics
OOS_lm <- calc_performance(model_test$Y, OOS_lm_model1)

OOS_lm
## $rmse
## [1] 0.5231223
## 
## $mae
## [1] 0.3588577
## 
## $mape
## [1] Inf

Our model gives a RMSE of 0.523.

Now we run our model on our test data set to predict the trip duration.

model_output <- predict(lm_model1, X_pred)

## taking inverse log transformation to get actual trip duration

output <- exp(model_output)

final_result <- cbind(test$id, output)

write.csv(final_result, file = "C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/final_prediction.csv")

Conclusion

Our final linear regression model is able to predict our trip duration values in our test data. From our detailed exploration of the different factors - weather, traffic, road conditions and trip characteristics we were able to identify trends in the data which seem to impact a cab trip duration.

The analysis shows that people are on the move a lot in New York (by taxi). The “quietest” times in New York City are in the wee hours of the morning on the weekends. Another interesting observation from our data is that majority of our pickups and drop off happens near JFK.


New Functions used in the analysis

In this section we will cover the new functions we have applied to complete our Analysis

Correlation calculation and p value

To generate an exhaustive set of correlation matrix and their significance we have use the rcorr function from the package Hmisc. For our training data set we required to understand the features which are highly correlated to each other before putting them in our model as that would result in fluctuation of our model result.

rcorr computes a matrix of Pearson’s r or Spearman’s rho rank correlation coefficients for all possible pairs of columns of a matrix. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. Ranks are computed using efficient algorithms, using midranks for ties.

One hot encoding for categorical variables

Our model had final 9 categorical variable which showed impact on our response variable trip duration. As linear regression model can’t directly take categorical variables as is we had to encode these in a format which the model can understand. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

We have used the function dummyVars from the caret package to encode our categorical variables. This function produces a dummy variables data set. We utilized it to break the categorical variables into columns with values of either 0 or 1, whereas if a specific category was present for the ride ID it would be marked as 1, the rest are marked 0 otherwise.

Transforming our date time variable to different categories

In our train data we had 2 date fields - pickup date time and dropoff date time, from which needed to extract month, day, day of week and hours to understand the different trends observed in trip duration with time.

In our datasource the date time field were in character format. To convert these in readable datetime format we used As.POSIXct from the library lubridate. I addition to this we also used format function from base R package to extract the relevant information for our time-based analysis.

Statisical Indicators for road condition variable

We used the Profiling_num function from library funModeling to get a glimpse of statistical indicators for all numerical variables. We used it to get a quick digest list of mean, median, variability, quantiles, range, etc. for roads data.

To transform data to Matrix

We used the sapply function from the base R package to convert into matrix. sapply converts from data frames, vectors or lists to a matrix or vector. We took the traffic data vector to find null values and output a matrix with columns indicating how many nulls were present.